{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SQL Queries with Differential Privacy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read data\n",
    "Smartnoise supports issue SQL queries against CSV files, database engines, and Spark clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     age  sex  educ  race   income  married\n",
      "0     59    1     9     1      0.0        1\n",
      "1     31    0     1     3  17000.0        0\n",
      "2     36    1    11     1      0.0        1\n",
      "3     54    1    11     1   9100.0        1\n",
      "4     39    0     5     3  37000.0        0\n",
      "..   ...  ...   ...   ...      ...      ...\n",
      "995   73    0     3     3  24200.0        0\n",
      "996   38    1     2     3      0.0        0\n",
      "997   50    0    13     1  22000.0        1\n",
      "998   44    1    14     4    500.0        1\n",
      "999   29    1    11     1  66400.0        0\n",
      "\n",
      "[1000 rows x 6 columns]\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "pums = pd.read_csv('../../datasets/PUMS.csv')\n",
    "\n",
    "print(pums)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Execute DP query\n",
    "\n",
    "Open a private SQL connection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  married    n\n",
      "0       0  454\n",
      "1       1  548\n"
     ]
    }
   ],
   "source": [
    "from snsql import *\n",
    "\n",
    "metadata = '../../datasets/PUMS.yaml'\n",
    "\n",
    "\n",
    "private_reader = from_connection(\n",
    "    pums, metadata=metadata, \n",
    "    privacy=Privacy(epsilon=1.0, delta=1/1000)\n",
    ")\n",
    "\n",
    "query = 'SELECT married, COUNT(*) AS n FROM PUMS.PUMS GROUP BY married'\n",
    "\n",
    "result_dp = private_reader.execute_df(query)\n",
    "print(result_dp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note**, in the above example, we query the PUMS microdata to get the count of individuals by marriage status.  If you run the private query repeatedly, you will see that the answer changes a bit between queries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  married    n\n",
      "0       0  449\n",
      "1       1  549\n"
     ]
    }
   ],
   "source": [
    "result_dp = private_reader.execute_df(query)\n",
    "print(result_dp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `PrivateReader` allows you to swap in differentially private results wherever exact results are currently used, it can accept some additional paramaters to control privacy/accuracy tradeoff.  Smaller epsilon will be more private, but less accurate. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epsilon is: 4.0\n",
      "  married    n\n",
      "0       0  451\n",
      "1       1  548\n",
      "\n",
      "epsilon is: 0.1\n",
      "  married    n\n",
      "0       0  453\n",
      "1       1  534\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for epsilon in [4.0, 0.1]:\n",
    "    private_reader = from_connection(\n",
    "        pums, metadata=metadata, \n",
    "        privacy=Privacy(epsilon=epsilon, delta=1/1000)\n",
    "    )\n",
    "    print(f\"epsilon is: {epsilon}\")\n",
    "    result = private_reader.execute_df(query)\n",
    "    print(result)\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Execute normal query \n",
    "Calling the underlying `Reader` directly will give the exact result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   married    n\n",
      "0        0  451\n",
      "1        1  549\n"
     ]
    }
   ],
   "source": [
    "result = private_reader.reader.execute_df(query)\n",
    "\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Metadata file\n",
    "\n",
    "The `PrivateReader` needs some metadata that describes the data source.  Differentially private processing needs to know which columns can be used in numeric computations, as well as information about the sensitivity of data, and which column is the private identifier.  Metadata should be provided by the data owner, and should not be data-dependent.  For example, the acceptable range for the `age` column should be domain-specific, and should not use the actual minimum and maximum values from the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PUMS.PUMS [1000 rows]\n",
      "\tage [int] (0,100)\n",
      "\tsex (card: 0)\n",
      "\teduc (card: 0)\n",
      "\trace (card: 0)\n",
      "\tincome [int] (0,500000)\n",
      "\tmarried (card: 0)\n"
     ]
    }
   ],
   "source": [
    "import snsql\n",
    "meta = snsql.metadata.Metadata.from_file('../../datasets/PUMS.yaml')\n",
    "print(meta)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
